Active Learning based on Random Forest and Its Application to Terrain Classification
نویسندگان
چکیده
In the machine learning literature many supervised algorithms have been proposed to perform pattern classification tasks. But in many pattern recognition tasks, labels are often expensive to obtain while a vast amount of unlabeled data are easily available. And redundant samples are often included in the training set, thus slowing down the training process of the classifier without improving classification results. To solve this problem, active learning [1][2] techniques are proposed to select the most valuable samples for manually labeling to train a classifier. Uncertainty, density, and diversity are three of the most important criteria in active learning. Uncertain samples are usually able to improve the current classifier most. The most popular uncertainty sampling is SVMactive [3] [4] that selects the sample nearest to the current decision boundary. In density sampling, samples in dense regions are thought to be representative and informative. The cluster structure of unlabeled data is usually exploited to find samples in dense regions. The main weakness of uncertainty and density sampling is that they are unable to exploit the abundance of unlabeled data. Thus the diversity criterion was proposed to select a set of unlabeled samples that are as more diverse as possible in the feature space, which reduces the redundancy among the samples selected at each iteration. Recently, some active learning algorithms tried to combine two criteria to find the optimal samples. In [5], Huang et al. tried to query informative and representative examples based on the min-max view of active learning [6]. Some active learning techniques also query a batch of unlabeled samples at each iteration by considering both uncertainty and diversity criteria [7] [8]. Shi et al. [9] proposed a batch mode active learning method for Networked Data with three criteria (i.e., minimum redundancy, maximum uncertainty, and maximum impact). The processing platform for active learning should be considered as well. Among many others, the distributed processing systems are gaining many attention and are suitable for active learning system that gathers samples from many distributed locations, and processes them as one virtual entity. Such solution was proposed in [10] where the system that optimizes the processing task allocation in Peerto-Peer based computing architecture was proposed. In [11], the decentralized approach was shown, also supporting the multiple data sources (suitable to obtain samples). Large numbers of active learning algorithms are based on SVM and regression classifier. But there is little work about active learning using random forest classifier. According to the information we have, DeBarr et al. have made an exploration in random forest active learning [12]. In this paper, we proposed a novel active learning algorithm based on random forest that selects samples with large uncertainty, density, and diversity for manual labeling. For each unlabeled samples, we use the difference between the most votes and second most votes from the random forest classifier to measure its uncertainty. The average distance between the sample and its k-nearest unlabeled neighbors is used to measure the density while the distance between the sample and its nearest labeled neighbor is used to measure the diversity. The rest of this paper is organized as follows. Section 2 describes the proposed active learning based on random Y. Gu (*) Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
منابع مشابه
A Random Forest Classifier based on Genetic Algorithm for Cardiovascular Diseases Diagnosis (RESEARCH NOTE)
Machine learning-based classification techniques provide support for the decision making process in the field of healthcare, especially in disease diagnosis, prognosis and screening. Healthcare datasets are voluminous in nature and their high dimensionality problem comprises in terms of slower learning rate and higher computational cost. Feature selection is expected to deal with the high dimen...
متن کاملSemi-Supervised Learning Based Prediction of Musculoskeletal Disorder Risk
This study explores a semi-supervised classification approach using random forest as a base classifier to classify the low-back disorders (LBDs) risk associated with the industrial jobs. Semi-supervised classification approach uses unlabeled data together with the small number of labelled data to create a better classifier. The results obtained by the proposed approach are compared with those o...
متن کاملApplication of ensemble learning techniques to model the atmospheric concentration of SO2
In view of pollution prediction modeling, the study adopts homogenous (random forest, bagging, and additive regression) and heterogeneous (voting) ensemble classifiers to predict the atmospheric concentration of Sulphur dioxide. For model validation, results were compared against widely known single base classifiers such as support vector machine, multilayer perceptron, linear regression and re...
متن کامل3D Detection of Power-Transmission Lines in Point Clouds Using Random Forest Method
Inspection of power transmission lines using classic experts based methods suffers from disadvantages such as highel level of time and money consumption. Advent of UAVs and their application in aerial data gathering help to decrease the time and cost promenantly. The purpose of this research is to present an efficient automated method for inspection of power transmission lines based on point c...
متن کاملForest Stand Types Classification Using Tree-Based Algorithms and SPOT-HRG Data
Forest types mapping, is one of the most necessary elements in the forest management and silviculture treatments. Traditional methods such as field surveys are almost time-consuming and cost-intensive. Improvements in remote sensing data sources and classification –estimation methods are preparing new opportunities for obtaining more accurate forest biophysical attributes maps. This research co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014